Name-Ethnicity Classification and Ethnicity-Sensitive Name Matching

نویسندگان

  • Pucktada Treeratpituk
  • C. Lee Giles
چکیده

Personal names are important and common information in many data sources, ranging from social networks and news articles to patient records and scientific documents. They are often used as queries for retrieving records and also as key information for linking documents from multiple sources. Matching personal names can be challenging due to variations in spelling and various formatting of names. While many approximated name matching techniques have been proposed, most are generic string-matching algorithms. Unlike other types of proper names, personal names are highly cultural. Many ethnicities have their own unique naming systems and identifiable characteristics. In this paper we explore such relationships between ethnicities and personal names to improve the name matching performance. First, we propose a name-ethnicity classifier based on the multinomial logistic regression. Our model can effectively identify nameethnicity from personal names in Wikipedia, which we use to define name-ethnicity, to within 85% accuracy. Next, we propose a novel alignment-based name matching algorithm, based on Smith–Waterman algorithm and logistic regression. Different name matching models are then trained for different name-ethnicity groups. Our preliminary experimental result on DBLP’s disambiguated author dataset yields a performance of 99% precision and 89% recall. Surprisingly, textual features carry more weight than phonetic ones in nameethnicity classification.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Science and Ethnicity: How Ethnicities Shape the Evolution of Computer Science Research Community

Globalization and the world wide web has resulted in academia and science being an international and multicultural community forged by researchers and scientists with different ethnicities. How ethnicity shapes the evolution of membership, status and interactions of the scientific community, however, is not well understood. This is due to the difficulty of ethnicity identification at the large ...

متن کامل

A "Roziah" by any other name: a simple Bayesian method for determining ethnicity from names.

Correct identification of ethnicity is central to many epidemiologic analyses. Unfortunately, ethnicity data are often missing. Successful classification typically relies on large databases (n > 500,000 names) of known name-ethnicity associations. We propose an alternative naïve Bayesian strategy that uses substrings of full names. Name and ethnicity data for Malays, Indians, and Chinese were p...

متن کامل

A Review of Name-based Ethnicity Classification Methods and their Potential in Population Studies

BACKGROUND: Several approaches have been proposed to classify populations into ethnic groups using people’s names, as an alternative to ethnicity self-identification information when this is not available. These methodologies have been developed, primarily in the Public Health and Population Genetics literature in different countries, in isolation from and with little participation from demogra...

متن کامل

An ontology of ethnicity based upon personal names with implications for neighbourhood profiling

Understanding of the nature and detailed composition of ethnic groups remains key to a vast swathe of social science and human natural science. Yet ethnic origin is not easy to define, much less measure, and ascribing ethnic origins is one of the most contested and unstable research concepts of the last decade not only in the social sciences, but also in human biology and medicine. As a result,...

متن کامل

Use of name recognition software, census data and multiple imputation to predict missing data on ethnicity: application to cancer registry records

BACKGROUND Information on ethnicity is commonly used by health services and researchers to plan services, ensure equality of access, and for epidemiological studies. In common with other important demographic and clinical data it is often incompletely recorded. This paper presents a method for imputing missing data on the ethnicity of cancer patients, developed for a regional cancer registry in...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012